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Abstract 

A metric graph is a 1-dimensional stratified metric space consisting of vertices and edges 
or loops glued together. Metric graphs can be naturally used to represent and model data 
that take the form of noisy filamentary structures, such as street maps, neurons, networks 
of rivers and galaxies. We consider the statistical problem of reconstructing the topology 
of a metric graph from a random sample. We derive a lower bound on the minimax risk for 
the noiseless case and an upper bound for the special case of metric graphs embedded in 



?2 



The upper bound is based on the reconstruction algorithm given in Aanjaneya et al 



(2012) 



Keywords: Metric Graph, Filament, Reconstruction, Manifold Learning, Minimax Esti- 
mation 



1. Introduction 



We are concerned with the problem of estimating the topology of filamentary data structure. 
Datasets consisting of points roughly aligned along intersecting or branching filamentary 
paths embedded in 2 or higher dimensional spaces have become an increasingly common type 
of data in a variety of scientific areas. For instance, road reconstruction based on GPS traces, 
localization of earthquakes faults, galaxy reconstruction are all instances of a more general 
problem of estimating basic topological features of an underlying filamentary structure. The 
recent paper by Aanjaneya et al. (2012), upon which our work is based, contains further 



applications, as well as numerous references. To provide a more concrete example, consider 
Figure [TJ The left hand side displays raw data portraying a neuron from the hippocampus 
of a r at (|Gulyas et aL 1999). The data were obtained from NeuroMorpho.Org (Ascoli 
et al.[ |2007[ ). The right hand side of the figure shows the output of the metric graph 
reconstruction obtained using the algorithm analyzed in this paper, originally proposed by 
Aanjaneya et al. (2012). The reconstruction, which takes the form of a graph, captures 



perfectly all the topological features of the neuron, namely, the relationship between the 
edges and vertices, the number of branching points and the degree of each node. 

Metric graphs provide the natural geometric framework for representing intersecting 
filamentary structures. A metric graph embedded in a D-dimensional Euclidean space 
(D > 2) is a 1-dimensional stratified metric space. It consists of a finite number of points (0- 
dimensional strata) and curves (1-dimensional strata) of finite length, where the boundary 
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of each curve is given by a pair (of not-necessarily distinct) vertices (see the next section 
for a formal definition of a metric graph) . 

In this paper we study the problem of reconstructing the topology of metric graphs from 
possibly noisy data, from a statistical point of view. Specifically, we assume that we have a 
sample of points from a distribution supported on a metric graph or in a small neighborhood 
and we are interested in recovering the topology of the corresponding metric graph. To this 
end, we use a slightly modified version of the metric graph reconstruction algorithm given in 



Aanjaneya et al. (2012). Furthermore, in our theoretical analysis we characterize explicitly 



the minimal sample size required for perfect topological reconstruction as a direct function 
of parameters defining the shape of the metric graph, introduced in Section [2] This leads 
to an upper bound on the risk of topological reconstruction in dimension D = 2, which we 
conjecture hold also in higher dimension. Finally, we obtain a lower bound on the risk of 
topological reconstruction, valid in arbitrary dimensions and for data observed exactly along 
the metric graph. The lower bound almost matches the derived upper bound, indicating 
that the algorithm of |Aanjaneya et"aL (2012) behaves nearly optimally. 



Outline. In Section [2] we formally define metric graphs, the statistical models we will 
consider and the assumptions we will use throughout. We will also describe several geometric 
quantities that are central to our analysis. Section [3] contains detailed analysis of the 
performance of algorithm of Aanjaneya et al. ( |2012[ ) for metric graph reconstruction, under 
modified settings and assumptions. In Section [4] we derive lower and upper bounds for 
the minimax risk of metric graph reconstruction problem. Two examples are considered in 
Section [5j The first is map reconstruction and the second is the neuron example mentioned 
above. In Section [6] we conclude with some final comments. 



Related Work. The work most closely related to ours is Aanjaneya et al. (2012) which 
was, in fact, the motivation for our work. The algorithm we analyze is in fact a minor 
modification of the metric graph reconstruction algorithm analyzed by those authors. From 



the theoretical side, we replace the key assumption in Aanjaneya et al. (2012) of the sample 



being a (e, ^-approximation to the underlying metric graph, by the milder assumption of 
the sample being dense in a neigborhood of the metric graph. Metric graph reconstruction 
is related to the problem of estimating stratified spaces (basically, intersecting manifolds). 
Stratified spaces have been studied by a number of authors such as Bendich et al. ( 2010[ 



m 



2012); Bendich (2008). A spectral method for estimating intersecting structures is given 



Arias-Castro et al. 



(2011). There are a variety of algorithms for specific problems, for 



example, see Ahmed and Wenk (2012); Chen et al. (2010) for the reconstruction of road 
networks. Finally, Chernov and Kurlin (2013) derived an alternative algorithm that uses 
ideas from homology. 



2. Background and Assumptions 



The assumptions in Aanjaneya et al. (2012) lead to a reconstruction process that is aimed 



at capturing the intrinsic structure of the data and is somewhat oblivious to its extrinsic 
embedding. By considering data embedded in the Euclidean space and focusing on the 
topological aspect, we show that weaker assumptions are needed to guarantee a correct 
reconstruction. 
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Figure 1: Left: A neuron from the hippocampus of a rat. Right: A metric graph recon 
struction of the neuron. The data are from NeuroMorpho.Org (Ascoli et al. 



2007). 



In this section we provide background on metric graph spaces and describe the assumptions 
and the geometric parameters that we will be using throughout. 

Informally, a metric graph is a collection of vertices and edges glued together in some 
fashion. Here we state the formal definitions of path metric space and metric graph. For 



more details see Aanjaneya et al. (2012) and Kuchment (2004). 



Definition 1 A metric space (G, do) is a path metric space if the distance between any 
pair of points is equal to the infimum of the lengths of the continuous curves joining them. 
A metric graph is a path metric space (G,dc) that is homeomorphic to a 1-dimensional 
stratified space. A vertex of G is a 0-dimensional stratum of G and an edge of G is a 
1-dimensional stratum of G. 

We will consider metric graphs embedded in MP . Note that, if one ignores the metric 
structure, namely the length of edges and loops, the shape or topology of a metric graph 
(G, do) is encoded by a graph, whose vertices and edges correspond to vertices and edges of 
G. Since we allow for two vertices to be connected by more than one edge we are actually 
dealing with pseudographs. We recall that an undirected pseudograph (V, E) is a set of 
vertices V, a multiset E of unordered pairs of (not necessarily distinct) vertices. To a given 
pseudograph we can associate a function / : E — > V x V, which, when applied to an edge 
e G E, simply extracts the vertices to which e is adjacent. Thus, if ei, e-i G E are such that 
/( e i) = f( e 2), then e\ and e2 are parallel edges. Similarly, if e G E is such that /(e) = {v, v} 
for some v G V, then e is a loop. For each pair (u, v) G V X V, let u(u, v) = \f~ 1 ({u, v})\ if 
{u, v } G E and otherwise. In particular, v{u, v) is the number of edges between u and v 
(or loops if u = v). 
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We say that a metric graph reconstruction algorithm perfectly recovers the topology of 
G if outputs a pseudograph isomorphic to the pseudograph representing the topology of G. 
We now define some key quantities regarding the structure of a metric graph. We start 



with the general definition of condition number. See Chazal and Lieutier (2006) and Niyogi 



et al. (2008) 



Definition 2 The condition number of a 1- dimensional manifold M embedded in M. D , with 
boundary {v,v'}, is the largest number r such that the open normal bundle about M\{v,v'} 
of radius r is embedded in M. D for every r < r . 

The condition number is a measure of the curvature of a manifold. A manifold with large 
condition number does not come too close to be self-intersecting. For example the condition 
number of an arc of a circle is equal to its radius. Each edge of a metric graph (G, do) can 
be seen as a 1-dimensional manifold with boundary. Let the local condition number be the 
minimum of the condition numbers associated to the edges of a metric graph. To control 
points far away in the graph distance, but close in the embedding space, we define 

Ab = {(x, x) £ G : dc{x, x') > min(6, ra)} 

and we define the global condition number as the infimum of the Euclidean distances among 
pairs of point in A^, i.e. fnf^ \\x — x'\\. 

When 2 edges intersect at a vertex v they create an angle, where the angle between 
two intersecting curves is formally defined as follows. Suppose that e\ and e2 intersect at 
x. Let B(x,e) be the D— dimensional ball of radius e centered at x. Let ^i(e) be the line 
segment joining the two points x and B(x,e)f]ei. Let ^(e) be the line segment joining 
the two points x and B(x, e) ei- Let a e (ei, e^) be the angle between i\{e) and The 
angle between e\ and ei is a(ei,e2) = lim e „>o a 6 (ei, ci). We assume that, for each pair of 
intersecting edges e\ and e2, the angle a{e\,e2) is well-defined. 

Let (G, da) be a metric graph and, for a constant a > 0, let Gffiu = {y : inf^gc | \x—y\ \ < 
a}, the a tube around G. In particular, if a = 0, then, trivially, G © a = G. Notice that 
GfficT is a set of dimension D if a > 0. The problem of metric graph reconstruction consists 
of reconstructing a metric graph G given a sample {yi, . . . ,y n } = Y C G(Bo~ endowed with a 
distance dy, which could be the .D-dimensional Euclidean distance or some more complicate 
distance. If a = we say that the sample Y is noiseless, while if a > 0, we say that Y is a 
noisy sample. 

We will use the assumption that the sample Y is sufficiently dense in G © a with respect 
to the Euclidean metric, as formalized below. 

Definition 3 The sample Y = {y%, . . . , y n } C G © a C M D is ^-dense in G © c if for every 
x G G © a, there exists a y € Y such that \\x — y\\ < |. 

Note that this standard assumption is necessary for any recovery guarantee. In this 
paper we are interested in characterizing how dense a sample needs to be in order to 
guarantee, with high probability, perfect recovery of the topology of a metric graph. While in 



our analysis we will mainly rely on the assumption of a dense sample, Aanjaneya et al. (2012 ) 
used the more refined but stronger assumption of the sample being an (e, ^-approximation, 
which we recall. 
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Definition 4 Given positive numbers £,R, we say that (Y, oIy) is an (e, R)- approximation 
of the metric space (G, d G ) if there exists a correspondence C C G x Y such that 

(x,y), {x',y') G C,mm(d G (x,x'),dy(y,y')) < R =>• \d G (x,x) - d Y (y,y')\ < e. (1) 



As shown in Aanjaneya et al. (2012), the (e, i2)-approximation assumption is sufficient, 
for appropriate choice of the parameters e and R, to recover not only the topology of a 
metric graph (G,d G ), but also its metric d G with high accuracy. However, when compared 
to the dense sample assumption, it demands a larger sample complexity to achieve accurate 
topological reconstruction. We illustrate this claim with a simple example. 

We need one more definition. The Rips-Vietoris graph 1Z K (Y) is the graph with vertex 
set Y and an edge exists between yi and yj iff \ \yi — yj\ \ < k. 

Example 1 Figure^ shows a metric graph (G,d G ) embedded inM 2 (top left) consisting of 
three edges of length 1 that intersect at a common vertex O forming 3 angles of width 2tt/3 
and a S/2-dense sample Y = {yi, . . . , y n } in G (top right), for some 5 > 0. 



The authors of Aanjaneya et al. (2012) define g?y(2/i,2/2) t° be the length of the shortest 
path connecting y\ and yi in the Rips-Vietoris graph 1Z K (Y). Their key assumption is that 
(Y, d Y ) is an (e, R)- approximation of(G,d G ). 

We would like to keep k as small as possible with the constraint k> 5, to guarantee that 
1Z K (Y) is a connected graph. Thus we set k = 5. In this example, the values e = 1/128 



and R = 5/17 satisfy Theorems 1 and 2 in Aanjaneya et al. (2012) for topological and 
metric reconstruction. The reconstruction algorithm, described here in Section^ provides 
a reconstructed graph G (Figure^ bottom right) that is isomorphic to G and a distance <ig 
that approximates d G . 

Consider the bottom left plot of Figure^ where {x, x'} C G, {y, y' , z, z'} C Y, y') = 
R, d Y {z, z') = 5 and (x, y), (x', y') G C . Note that \\0 — z\\ = <5/\/3 and 

\d G (x,x') - d Y (y,y')\ = 

Therefore (Y, d Y ) is a (1/128, 5/ 17) -approximation of the metric graph (G,d G ) if 5 < 
^jijg = 0.00677. This strong condition guarantees topological and metric reconstruction. 
By focusing on the topological aspect, we show that minor modifications of the algorithm 
guarantee a correct reconstruction with weaker assumptions on the sample Y. In particular, 
for this example, a 0.08/2-dense sample is sufficient. 

Throughout our analysis we restrict the attention to metric graphs embedded in M. D 
that satisfy the following assumptions: 

Al The graphs are free of nodes of degree 2 (though they may contain vertices of degree 
1 or 3 and higher). 

A2 Each edge is a smooth embedded sub-manifold of dimension 1, of length at least b > 
and with condition number at least r > 0. 

A3 Each pair of intersecting edges forms a well-defined angle of size at least a > 0. 
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Figure 2: Top left: metric graph (G, do) formed by 3 edges of length 1 intersecting at a 
single vertex and forming 3 angles of width 2tt/3. Top right: sample YcGffiO. 
Bottom left: example of the distortion between cIg(x,x') and d^(y,y'). Bottom 
right: reconstructed graph G. 
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A 4 The global condition number is at least £ > 0. 
Assumptions Al and A2 allow us to consider each edge of a metric graph as a single smooth 



curve. This assumption is common in the literature. See Chen et al. (2010) for a road 
reconstruction algorithm that allows for corners within an edge. 

Let G be the set of metric graphs embedded in H D that satisfy assumptions Al, A2, A3 
and A4. We consider the collection V of probability distributions supported over metric 
graphs (G, do) in Q having densities p with respect to the length of G bounded from below 
by a constant a > 0. We are interested in bounding the minimax risk 



Rn = inf sup P n [G ^G , (2) 

G Pev v ' 

where the infimum is over all estimators G of the topology of (G, da), the supremum is over 
the class of distributions V for Y and G 9^ G means that G and G are not isomorphic. 
In Section [4] we will find a lower bound for R n and an upper bound for the special case of 
metric graphs embedded in M 2 . 

We conclude this section by summarizing the many parameters and symbols involved in our 
analysis. See Table [TJ 



Symbol 


Meaning 


(G, da) 


metric graph 


a 


smallest angle 


b 


shortest edge 


T 


local condition number 


£ 


global condition number 


G 


set of metric graphs embedded in ~R D , satisfying A1-A4 


V 


set of distributions on G with density bounded by a > 


G®a 


a tube around G 


Y 


sample, subset of G © a 


5 


Y is a (5/2-dense sample 




R-V graph on Y of parameter k 



Table 1: Summary of the symbols used in our analysis. 



3. Performance Analysis for the Algorithm of Aanjaneya et al. (2012) 

In this section we will study the performance of the metric graph reconstruction algorithm 
of Aanjaneya et al. (2012), under assumptions A1-A4 and with the choice of parameters 
adapted to our setting. In Section [4] we will use these results to derive upper bounds on 
the minimax rate for topology reconstruction for noiseless samples. Our analysis applies 
to metric graphs embedded in M 2 , though we conjecture that it carries over to arbitrary 
dimensions. 
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The metric graph reconstruction algorithm is presented in Algorithm [T] 

Input: sample Y, dy- 

1: Labeling points as edge or vertex 

2: for all y E Y do 

3: S y ^B(y,r + S)\B(y,r) 

4: deg r (y) 4— Number of connected components of Rips-Vietoris graph TZg(S y ) 

5: if deg r (y) = 2 then 

6: Label y as a edge point 

7: else 

8: Label y as a preliminary vertex point. 

9: end if 
10: end for. 

11: Label all points within Euclidean distance p n from a preliminary vertex point as vertices. 
12: Let E be the point of Y labeled as edge points. 
13: Let V be the point of Y labeled as vertices. 
14: Reconstructing the graph structure 

15: Compute the connected components of the Rips-Vietoris graphs 1Zg(E) and IZg (V). 

16: Let the connected components of lZg(Y) be the vertices of of the reconstructed graph G. 

17. Let there be an edge between vertices of G if their corresponding connected components 

in Ttg(y) contain points at distance less than S from the same connected component of 1Z$(E). 
Algorithm 1: Metric Graph Reconstruction Algorithm 



The inner radius of the shell at Step 3 and the width of the expansion at Step 11 are 
parameters the user has to specify. Note that in the original algorithm of Aanjaneya et al. 



(2012) the expansion of Step 11 was performed using the distance dy, since an embedding 
of the space was not required. 

The algorithm takes a (possibly noisy) sample Y from a metric graph G and a distance c?y 
defined on Y and returns a graph G that approximates G. We will analyze the algorithm 
considering first the Euclidean distance and then specifying a more complex distance g?y on 
the sample Y. 



Before providing the details of our analysis, we demonstrate the worst case scenario of 
two edges e\ and 62, with minimum condition number r forming the smallest angles a at 
vertex x. See Figure [3] (left). Note that ei and e2 are simply arcs of circles of radius r. O 
and O' are the centers of the circles associated to edges e\ and e2- It is easy to see that 
angle OxO' has width tt — a. 

Let Y be a noisy sample of G. In other words Y is a subset of G © a, the tube of radius 
a > around the metric graph G. See Figure [3] (right). 

For a > 0, the smallest angle formed by the inner faces of the tube around the metric 
graph is 

2(T-a) 2 -4r 2 cos 2 (a/2) 

a = tt — arccos -. -k , 

2(r — a) z 

where we applied the cosine law to the triangle OsO' and the fact that angle O'sO' has 
width tt — a' . Note that if a = then a' = a. 
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Figure 3: Left: edges e\ and ei with minimum condition number r forming the smallest 
angles a at vertex x. Right: same metric graph with a tube of radius a around 
it. 



Let Q be the midpoint of segment OO' and let T be the intersection point of 00' and 
edge e\. It can be shown that 



xOO' = a/2, 
TxQ = a/4. 



(3) 
(4) 



Analysis of Algorithm [T] with Euclidean distances 

Consider the framework introduced in Section [2] for metric graphs embedded in M 2 and 
Algorithm 1 with input (Y, dy = || • ||), that is, the Euclidean distance is used at every 
step. The algorithm requires the specification of r, the inner radius of the shell, and pn, 
the parameter governing the expansion of Step 11. We set 



I S 5 

r =max - +Tsin(a/2) - (r - a) sin(a'/2) + - . , 

2 2sm(a/4) 



t 2 + (r - a) 2 - 2r(r - a)\ 1 



2(r-a) 
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and 



f + rsin(a/2) - (r - a) sin(a'/2) + r 



| + r sin(a/2) - (r - a) sin(o//2) + 



if a' > 7r/2 and 

r . , / \sin(2a'— 7r) 

o + r < (r — ct — ■ ? ,n 



otherwise. 



This choice is justified in the proof of Proposition [5} 
Proposition 5 IfYisi- dense in G © a and 

< a < t[1 -cos(a/2)], 
0<r + 5<£-2a, 



niin(6, ar) — (a — a')r 
sm | ^ ^ '— 1 > < 



r+pn+38/2 
T—a 



■ PTTTS +Pll+S/^ 
sin(a'/4) r ' 



if a' > 7r/2 and 

jr j * / \ sin(2a'— 7r) 

d + r < It — a) — ■ , ,\ 

— ' sin(a') 



(5) 
(6) 

(7) 



otherwise 



then the graph G provided by Algorithm 1 (input: Y, || • \\) is isomorphic to G. 

Proof First, note that condition ^ is a bound on the radius of the tubular noise around 
the metric graph. Consider the case depicted in Figure [3] (right). The radius of the tube 
must be less than half of the maximum distance between the represented edges \\Q — T\\ = 
\\0' — T\\ — \\0' — Q\\ = t — tcos(q/2). Next, condition ^ guarantees that points of G 
which are far apart in the metric graph distance do and close in the Euclidean distance do 
not interfere in the construction of the shells at Steps 3-4. 

The rest of the proof involves condition Q. Since the sample is |— dense in the tube, 
there is at least a point y G Y inside the ball of radius | centered at any vertex x £ G. 
When we apply Algorithm 1 we want to be sure that y is labeled as a vertex, i.e. the number 
of connected components of the shell around y is different than 2 (Steps 3-4). The worst 
case is depicted in Figure [4] (left), where x is the vertex of the minimum angle a, formed 
by two edges, e\ and e2- 

The inner faces of the tube of radius a around e\ and e2 form an angle of width a' at 
vertex s, as described at the beginning of this section. Let u and v be the two points on 
the faces of the tube such that they are equidistant from x and \\u — v\\ =5. Since at Step 
4 we construct a lZs(S y ) graph to determine the number of connected components of the 
shell S y and we want y to be a vertex, we choose r, the inner radius of the shell, so that if 
u, v £ Y then r > max{dY(?A u )i d-Yiv, v)}. This guarantees that Vti,t2 6 Y with ti around 
edge ei 3 ti around edge e2 with dY(y,ti) > r, dy(y,£2) > r, we have dy^i,^) > S, i.e. t\ 
and t2 belong to different connected component of the shell around y at Step 4. 
We bound the distance between y and u by ||y — x|| + \\x — s\\ + \\s — u\\, where, using ([3]), 



Q\\ - \\ s - Q\\ =rsin(a</2) - (r - a) sin(o//2) 



and using Q, 



2sin(a//4) 



(8) 
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Figure 4: Left: edges e\ and ei with minimum condition number r forming the smallest 
angles a at vertex x. Right: The distance \\F — G\\ between the two connected 
components of the shell around an edge point y' must be greater than 8. 



Therefore we require that r, the inner radius of the shell at Step 4 satisfies 



5 _ 5 
r ~ 2 + I|X " S|I + 2sin(a'/4) 



= : n. 



(9) 



Another condition on r arises when we label edge points far from actual vertices. See Figure 
[4] (right). The distance \\F — G\\ between the two connected components of the shell around 
an edge point y' must be greater than 5. By the cosine law, cos(/3) 



(i±d!±(i=£)!zr! and 



2(T 2 -<7 2 ) 



= 2(r - a) sin (p) 
= 2(r — a) sin I arccos 



2(r 2 + a 2 



2(r 2 -a 2 ) 



2{r-a)\l 



2(t 2 + C j 2 ) 



2(r 2 - a 2 ) 

Therefore the condition II F — G\\ > 5 can be written as 



r > 



2(r 2 + a 2 ) -2(r 2 -a 2 )Wl 



5 



2(r 



--: r 2 . 



(10) 



By Q and (10) we set r = max(ri,r2). 

The outer radius of the shell at Steps 3-4 has length r + S. This guarantees that when 
the shell around an edge point intersects the tube around G there is at least a point y € Y 
in each connected component of the shell, since Y is |-dense in G © a. 
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Then, for any edge of G we want to be sure that there is at least one point in Y which 
is labeled as an edge point. We start by considering two cases, depicted in Figure [5] 

(a) The segment of length r+5 orthogonal to the face of the tube around edge e\ intersects 
the face of the tube around edge ei at a point z. This happens when a! < \ or when 

(b) There are no segments of length r + 5 orthogonal to the face of the tube around 
edge ei that intersect the tube around edge e2- This happens when a' > f and 
5 + r < (r — a) bm }^ a i'^ ■ I n this case we simply consider the segment of length r + 5 
from s to a point z on e2- 




Figure 5: Left: case (a). Right: case (b). 

Suppose z £ Y. Among the points that might be labeled as vertices at Step 6 because 
of their closeness to vertex x, z is the furthest from x, since the shell around z is tangent to 
the tube around e±. At Step 11, in order to control the labelling of the points in the tube 
between y and z we would like to label all the points in {y' £ Y : \\y' — y\\ < \\y — z\\} as 
vertices. To simplify the calculation we use the following bound 



where, using Q, 



\y — z\\ < \\y — x\\ + \\x — s\\ + \\s 



r+8 



z\\ < sz := <; case ( & ) (ii) 

I r + 5 case (b) 



Therefore we set pu := — h||x — s|| + sz and at Step 11 we label all the points in {y' € Y : 
\\y' — y\\ < Pn and y is labeled as vertex at Step 6 } as vertices. 

If z is actually labeled as a vertex at Step 6, then through the expansion of Step 11, all the 
points at distance not greater than pn from z are labeled as vertices, too. 

Finally we determine the length of edge e2 so that there is at least a point in the tube 
around 62 labeled as an edge point after Step 11. Consider the worst case in which e± and 
e2 are forming an angle of size a at both their extremes x and x'. See Figure p) 
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Figure 6: Edges e\ and e2, forming an angle of size a at both their extremes x and x'. 

All the points y' £ Y such that \\y' — z\\ < pu or \\y' — z'\\ < pu might be labeled as 
vertices. When we construct 1Z(K)s and TZ(Y)$ at Step 15 the two sets of vertices on &2 
must be disconnected and there must be at least an edge point on ei- A sufficient condition 
is that the length of edge e2 is greater than 2{a\ + 02 + 03) + 04, where 

• a\ is the length of the arc of e2 formed by the projections of lines Ox and Os on e2, 

• a 2 is the length of the arc of e2 formed by the projection of the chord of length sz, 

• 03 is the length of the arc of e2 formed by the projection of the chord of length pu, 

• (Z4 is the length of the arc of e2 formed by the projection of the chord of length 5. 

Note that e2 = 2rarcsin ^ H 3 ^ II ^ = aT jj U ^ [ n general the shortest edge b might be shorter 
than it. So we require 

min(6, ar) > 2{a\ + 02 + 03) + 04. (12) 

By simple properties involving arcs and chords we have 

f a — a'\ (si, 
a\ = — - — r, 02 = 2r arcsm 



2 J \2(r-a 

( Pu \ ( & 

03 = 2r arcsin I — , 04 = 2r arcsin 



2{r-a)) ' " \2{r-a) 
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Since the arcsin is superadditive in [0, 1] we require the stronger condition 
min(6, err) — (a — a )r > 2r arcsin 

which holds if 

/ min(6, or) — (a — o,')t\ 2sz + 2p\\ + 5 

sm > r . 

\ 2r J 2(r-a) 

If this condition is satisfied then the graph is correctly reconstructed at Steps 15-17: every 
connected component of TZs(¥) corresponds to a vertex of G and every connected compo- 
nent of 1Z$ (E) corresponds to an edge of G. ■ 



2sz + 2pn + 5 



2(r 



0~ 



Analysis of Algorithm 1 with Rips graph distances. 

So far we have applied the reconstruction algorithm using the Euclidean distance at every 
step. Now we explore the consequences of using a different metric, a strategy closer to the 



original idea of Aanjaneya et al. (2012). In general, we obtain a less strict condition on 5 



expressed through a more complicated expression than the one of Proposition [5} Let k > 
and consider the Rips-Vietoris graph 1Z K (Y) with vertex set Y and edges connecting all 
the pairs of vertices at distance not greater than k from each other in M 2 . The metric dy 
is defined as the shortest path distance on 7Z K (Y). Note that the distance dy is used at 
Steps 3-5, while at Step 11 we still consider the Euclidean distance for the expansion. The 



following result is just a version of Theorem 2 in Bernstein et al. (2000). 



Lemma 6 Let k and 5 be positive, with 28 < k. IfY = {yi, . . . , y n } is ^-dense in G a, 
then for all pairs of data points y,y' G G © a we have 

<h{y, y') < ( 1 + — J d G(Ba (y, y'), 



where do&aiy, y') is the length of the shortest piecewise- smooth path connecting y and y' in 
the Euclidean space, such that it is entirely contained in G® o~. 



The parameters r and pn are now defined as follows: 



: max 



1 + 



25 



+ r sin(a/2) — (r — a) sin(c//2) + 2r arcsin 



5 



4(r - a) sin(a'/4) 



1 + 



25 \ 



2(r 2 + a 2 ) - 2(r 2 - a 2 ) 



2{r-a) 



and 



Pn 



f + r sin(a/2) - (r - a) sin(a'/2) + r 



M 

2 



+ r sin(a/2) - (r - a) sin(a7 2 ) + 



sin(o'/4) 

This choice is justified in the proof of Proposition [7j 



if a' > vr/2 and 

K - \ T °> sin(a') 

otherwise. 
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Proposition 7 IfYisi- dense in G © a and 



< a < r(l -cos(a/2)), (13) 
25 < k < £ - 2(7, (14) 



min(6, err) — (a — a')r 
sni I ^ ^— ^ ^ 1 > < 



( pn+3<5/2+r 
r— tj 



j/ a' > 7r/2 and 

,n(2a / 
sin(a') 



^ l \ sin(2a'— 7r) ,_ 

« < (r - g) J n(a n ; (15) 



( /4) otherwise 



then the graph provided by Algorithm 1 (input: Y, dyj isomorphic to G. 

Proof The steps are similar to those in the proof of Proposition [5] The condition a < 
r(l — cos(o/2)) is unchanged. The condition k < £ — 2a has the same meaning of condition 
Q. Now we define the inner radius of the shell. See Figure [4j Since at Steps 3-5 we use the 
distance dy on the Rips graph TZ K , we need an upper bound on shortest path connecting y 
and u inside the tube. This is given by \\y — x\\ + \\x — s\\ + s~u, where the last term is the 
length of the arc connecting s and u; s~u = 2(r — a) arcsin gfezpj • Therefore, using ^ and 
Lemma [6j condition ^ becomes 

( 2S \ f 5 „ „ / ^ • / 5 \\ 
r > 1 H — h \\x — s + 2(r — a) arcsin — ; — — — =: n. (16 

- V k/\ 2 V 4 ( T - CT ) sin («/ 4 )// 



Similarly condition (10) becomes 
2 A 



r > [ 1 + 



;\^ + ^- 2 ^-^V V2(^ 




2! ~- + rx-' ) - 2( T- - a- ) , / J - ( - - - ) =: r 2 (17) 



and we set r = maxjri, r2}. 

Then consider the following two cases, depicted in Figure [7j 

(a) The segment of length k orthogonal to the face of the tube around edge e\ intersects 
the face of the tube around edge e 2 at a point w. This happens when a' < | or when 

> C >, sin(2a'-7r) 

(b) There are no segments of length k orthogonal to the face of the tube around edge 
ei that intersect the tube around edge e 2 . This happens when a' > ^ and k < 

(r — a) hl °-^" a /) 7r ' ) ■ in this case we simply consider the segment of length k from s to a 
point w on e 2 . 

Then we consider point z on the face of the tube around edge e 2 such that ||u; — z\\ = 
r + 5 — k. Suppose z £ Y. Among the points that might be labeled as vertices at Step 6 
because of their closeness to vertex x, z is the furthest from x, since the shell (dy distance) 
around z is tangent to the tube around e%. At Step 11, in order to control the points in the 
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Figure 7: Left: case (a). Right: case (b). 

tube between y and z we label all the points in {y 1 G Y : \\y — y\\ < \\y — z\\} as vertices. 
To simplify the calculation we use the following bound 



\y — z\\ < \\y — x\\ + \\x — s\\ + \\s — w\\ + \\w 
5 
2 



< ^ + 11^ — s\\ + lis — w\\ + r + 5 



where, similarly to (11) 



case a 



||s-to|| < ^ := < sin (°'/ 4 ) 

case (b) 

Therefore we set p\\ '■= — h ||x — s|| + sw + r + 5 — k and at Step 11 we label all the points 

in {y' G Y : \\y' — y\\ < pn and y is labeled as vertex at Step 6 } as vertices. 

Finally we determine the length of edge e2 so that there is at least a point in the tube 
around e2 labeled as an edge point after Step 11. Consider the case of Figure [8] in which 
is forming an angle of size a at both its extremes x and x' . 

All the points y' G Y such that ||y' — z\\ < pn or \\y' — z'\\ < pn might be labeled as 
vertices. When we construct 1Z(K)s and 1Z(Y)s at Step 15 the two sets of vertices on e2 
must be disconnected and there must be at least an edge point on e% . A sufficient condition 
is that the length of edge e<i is greater than 2{a\ + a2 + 03 + 04) + 05, where 

• ai is the length of the arc of e2 formed by the projections of lines Ox and Os, 

• a% is the length of the arc of e2 formed by the projection of the chord of length sw, 

• 03 is the length of the arc of e2 formed by the projection of the chord of length r+5 — k, 

• 04 is the length of the arc of e2 formed by the projection of the chord of length pn, 

• 05 is the length of the arc of e2 formed by the projection of the chord of length 5. 



10 



Statistical Analysis of Metric Graph Reconstruction 




Figure 8: Edges e\ and e2, forming an angle of size a at both their extremes x and x' . 



Note that ei = 2r arcsin ^ ^ 2 t " ) = ar kut m g enera l the shortest edge 6 might be shorter 
than it. So we require 



min(6, ar) > 2{a\ + 02 + 03 + 04) + 05. 
By simple properties involving arcs and chords we have 



(18) 



a — a \ , I sw 



03 = 2r arcsin 



r + 5 — k 
2(t-<t) 



04 = 2r arcsin 



P11 



05 = 2r arcsin 



2(r-a)y' V2(t-ct) 

Since the arcsin is superadditive in [0, 1] we require the stronger condition 

/1 \ / „ . {2sw + 2(r + 5-K)+2 Pu + 8 

mm(o, ar) — (a — a )t > 2t arcsin 1 



which holds if 



sin 



2(r-a) 

min(6, err) — (a — a')r\ 2sw + 2pu + 35 + 2(r — k) 



2t 



> 



2{t-u) 
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If this condition is satisfied then the graph is correctly reconstructed at Steps 15-17: every 
connected component of TZs(V) corresponds to a vertex of G and every connected compo- 
nent of 72-5 (E) corresponds to an edge of G. ■ 



Remark 8 We believe that Propositions^ and^ are still valid in higher dimensions. That 
is, the worst case treated in M? is the worst case even in higher dimensions. 

4. Minimax Analysis 

In this section we derive lower and upper bound for the minimax risk 

R n = inf sup P n (g^g), (19) 



where, as described in Section |2j the infimum is over all estimators G of the metric graph 
G, the supremum is over the class of distributions V for Y and G 9^ G means that G and 
G are not isomorphic. 

4.1 Lower Bound 

To derive a lower bound on the minimax risk, we make repeated use of Le Cam's lemma. 



See, e.g., Tsybakov (2008). Recall that the total variation distance between two measures 
P and Q on the same probability space is defined by TV(P, Q) = sup ^4 \ P(A) — Q(A)\ where 
the supremum is over all measurable sets. 

Lemma 9 (Le Cam) Let Q be a set of distributions. Let 9{Q) take values in a metric 
space with metric p. Let Q\, Q 2 G Q be any pair of distributions in Q. Let Y±, . . . , Y n be 
drawn iid from some Q G Q and denote the corresponding product measure by Q n . Then 

inf sup E Qn \p(9, 9(Q))] > ip(0(Qi), 0(Q 2 ))(1 - TV(Qx, Q 2 )) 2n (20) 

where the infimum is over all the estimators. 

Below we will apply Le Cam's lemma using several pairs of distributions. Any pair 
Qi,Q2 will be associated with a pair of metric graphs Gi,G 2 G Q. We will take 9(Qi) 
and 9(Q 2 ) to be the classes of graphs that are isomorphic to G\ and G 2 - We will set 
p(9(Qi),9(Q 2 )) = if G\ and G 2 are isomorphic and p(9(Qi),9(Q 2 )) = 1 otherwise. 

Theorem 10 In the noiseless case, for b < bo(a), a < ao(a), £ < £o(a), r < ro(a), where 
bo(a), ao(a), £o(°) an d T o( a ) are constants which depend on a, a lower bound on the minimax 
risk for metric graph reconstruction is 

Rn > ^exp(-2amin{&,2sin(a/2),£,27TT}n). (21) 
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Proof We consider the 4 parameters separately. 
Shortest edge b 

Consider the metric graph G\ consisting of a single edge of length 1+b and another metric 
graph G 2 with an edge of length 1 and an orthogonal edge of length b glued in the middle. 
See Figure[9} The density on G\ is constructed in the following way: on the set W\ = G\\G 2 



A 



Wi 



A 



W 2 



Figure 9: G\ and G 2 

of length b we set pi(x) = a and the rest of the mass is evenly distributed over the remaining 
portion of G\. Similarly, for G 2 we set p 2 {x) = a on W 2 = G 2 \G\, which correspond to the 
orthogonal edge of length b. We evenly spread the remaining mass. The two densities differ 
only on the sets W\ and W 2 . Therefore 

TV(pi,p 2 ) < ab 

and, by Le Cam's lemma, 

R n > 1 (l-ab) 2n > 1 e - 2abn 
for all b < 6o(a), where bo(a) is a constant depending on a. 
Smallest angle a 

Now consider the metric graphs of Figure [10} G3 consists of two edges of length 2 forming 
an angle a and a third edge of length 1 + 2 sin(o/2) glued to the first two. G4 is similar: an 
edge of length 2sin(a/2) is added to complete the triangle, while the edge on the left has 
length 1. 




Figure 10: G3 and G4 

As in the previous case we set p%{x) = a on W3 = G^G^, p^x) = a on W4 = G^G^, 
and spread evenly the rest of the mass. The total variation distance is 

TV(p 3 ,P4) < 2a sin (| 
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and, by Le Cam's lemma, 

Rn > -(1 - 2a sin (a/2)) 2 ™ > l e - 4asin(cl/2)n 
8 8 

for all a < oto(a), where ao(a) is a constant depending on a. 
Global condition number £ 

We denned the global condition number as the shortest euclidean distance between two 



points far away in the graph distance. In Figure 11 we have metric graph G5 formed by a 
single edge of length 1 and metric graph G§ consisting of two edges of length 0.5, £ apart 
from each other. 

W 5 £ W 6 

A 1 1 A 1 1 1 1 



Figure 11: G5 and Gq 

Again, we set ps(x) = a on W5 = G^\Gq, p%{x) = a on W% = Gq\G^ and evenly spread 
the rest. We obtain 

TV(p 5 ,p 6 ) << 

and, by Le Cam's lemma, 

R n > -(l-<) 2n > - e - 2 < n 
for all £ < £0(0), where £o(a) is a constant depending on a. 

Local condition number r 

The local condition number r is the smallest condition number of the edges forming the 



metric graph. In Figure 12 we have metric graph G7 consisting of a loop of radius r attached 



to an edge of length 1 and metric graph Gg, a single edge of length 1 + ttt. 

W 8 




W 7 



Figure 12: Gj and G& 

As in the previous cases pr(x) = a on Wj = G 7 \Gs and ps(x) = a on Wg = Gs\Gy. It 
follows that 

TV(p 7 ,p 8 ) < 2airr 
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and, by Le Cam's lemma, 



Rn > 1 (1 - 2aiTT) 2n > 1 e - 4 ^ rn 



for all r < tq(o), where to (a) is a constant depending on a. 



4.2 Upper Bound 

In this section we use the analysis of the performance of Algorithm [T] to derive an upper 
bound on the minimax risk. We first need two useful lemmas. 



Lemma 11 (5.1 in Niyogi et al. (2008)) Let {Ai} for i = 1, . . . , I be a finite collection 
of measurable sets and let n be a probability measure on \j\ =x Ai such that for all 1 < i < I, 
we have n(Ai) > 7. Let x = {x±, . . . ,x n } be a set of n i.i.d. draws according to [i. Then if 

n>- ( log / + log \\ 

7 V 

we are guaranteed that with probability > 1 — A, the following is true: 

Vi, x n Ai / 0. 

Recall that the covering number C{8) of a set is the smallest number of balls of radius 5 
required to cover the set. The packing number P(S) is the largest number of non-overlapping 
balls of radius 5 whose centers are contained in the set. 



Lemma 12 (5.2 in Niyogi et al. (2008)) For every 5 > 0, 



P(25) < C(25) < P{6). 



By restricting the attention to metric graphs embedded in M. 2 and using Proposition [EJ 
we obtain the following upper bound on R n . 

Theorem 13 In the noiseless case (a = 0), for a < n/2, an upper bound on the minimax 
risk R n is given by 

8 length(G) ( and \ 
R n < j exp 



where 



4 length(G)) ' 



1 f 2sin(a/4) Tsin 2 (a/4) /min{6,crr} 
= - mm < £ . — — , „ — — — — — sm 1 



2 \ 3sin(a/4) + 1 ' sin 2 (a/4) +3sin(a/4) + 1 V 2r 
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Proof In the noiseless case, for a < vr/2, Proposition [5] implies that the graph G can be 
reconstructed from a 5/2— dense sample Y if 



5 < min < £ 



2sin(a/4) 



r sin 2 (a/4) 



'3sin(a/4) + l ' sin 2 (a/4) + 3 sin(a/4) + 1 



sin 



min{6, «t} 



2t 



(22) 



Condition (22) follows from @-([7]) and the simplified forms of r and p\\ for the noiseless 
case. We simply choose 



1 



min < £ 



2sin(a/4) 



r sin 2 (a/4) 



'3sin(a/4) + l sin 2 (a/4) + 3sin(a/4) + 1 



sm 



min{6, ar} 



2r 



and we look for the sample size n that guarantees a 5/2— dense sample with high probability. 
Following the strategy in Niyogi et al. (2008), we consider a cover of the metric graph G 
by balls of radius 5/\. Let {x{ : 1 < % < 1} be the centers of such balls that constitute a 
minimal cover. We can choose A^J A = Bgu(xi) n G. Applying Lemma 11 we find that the 
sample size that guarantees a correct reconstruction with probability at least 1 — A is 



log / + log 



where 



aS 



a length(^4^ 4 ) 

7 > min — — — - > - 

r - i length(G) ~ 4 length(G) 



and we bound the covering number I in terms of the packing number, using Lemma 12 

length(G) „ 8 length(G) 



Therefore if 



n 



length(4 /8 
4 length (G) 



ad 



log 



8 length(G) \ , 1 
+1 ° g A 



(23) 



we have ¥(G 9^ G) < X. Rearranging we have the desired result. 



Note that the upper and lower bounds are tight up to polynomial factors in the param- 
eters r, b, £. There is a small gap with respect to a; closing this gap is an open problem. 

5. Experiments 

Example 2 The Topology of Shadyside. In this example we try to reconstruct the 
main streets of Shadyside, a neighborhood of Pittsburgh. Figure [73] shows the main streets 
(yellow) of Shadyside obtained from Google Maps. There are 16 vertices and 21 edges for a 
total length of 9, 725 m. We imagine taking a walk in Shadyside and recording our position 
at several points in order to reconstruct the topology of the graph formed by the streets. 
Propositions [5| and [?] tell us how often we should record our position. The shortest edge 
has length 120 m and the smallest angle is tt/3 (at Highland & Centre Ave). The streets 
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are mostly straight, with a few exceptions. We assume r = 300m. Also, £ = 90m since 
Ellsworth Ave and Centre Ave become very close in a point far from the vertices. 

Assuming no noise, conditions (|6]) -([7]) simplify to (22) from which we get 5 < 2.17. On 
the other hand, condition (15) is satisfied for 5 = 2.46 and k = 25. Therefore registering 
the position every 2.46 m would guarantee a a correct reconstruction. Algorithm 1 provides 
the estimated graph plotted in red on top of the map in Figure 13, By (23), if we sample 
points uniformly we need 229, 273 points to reconstruct the graph with probability at least 
0.9. 




Figure 13: Shadyside, a neighborhood of Pittsburgh. 



Example 3 A Neuron in Three- Dimensions. In support of Remark [#| we return to 
the neuron example and we try to apply Propositions [5| and [?] to the 3D data of Figure [IJ 
namely the neuron cr22e from the hippocampus of a rat (Gulyds et al., 1999). The data 



were obtained from N 'euro Morpho. Org (Ascoli et al., 2001). The total length of the graph 
is 1750.86/um. We assume the smallest edge has length 100/um, the smallest angle vr/3, 
the local condition number 30 and £ = 50. The conditions of Proposition^ are satisfied for 
5 = 0.54, while those of Proposition^ are satisfied for 5 = 0.62 and k = 25. Figure^Tp shows 
the reconstructed graph. If we sample points uniformly from the original metric graph we 
need 114,435 points to correctly reconstruct the graph with probability at least 0.9. 



6. Conclusion 

In this paper, we presented a statistical analysis of metric graph reconstruction. We de- 
rived sufficient conditions on random samples from a graph metric space that guarantee 
topological reconstruction and we derived lower and upper bounds on the minimax risk for 
this problem. 

Various improvements and theoretical extensions are possible. First, we conjecture 



that our analysis for the performance of the algorithm of Aanjaneya et al. (2012) under 



the choice of parameters we describe and the corresponding minimax rates will extend to 
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higher dimensions. Currently, we are investigating the idea of combining metric graph re- 



construction with the subspace constrained mean-shift algorithm ( Fukunaga and Hostetler 



1975 


Comaniciu and Meer 


2002; 


Genovese et al. 



preliminary results indicate that this mixed strategy works very well under more general 
noise assumptions and with relatively low sample size. 
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